Resource recovery for checkpoint-based high-availability in a virtualized environment

ABSTRACT

A computer-implemented method provides checkpoint high-available for an application in a virtualized environment with reduced network demands. An application executes on a primary host machine comprising a first virtual machine. A virtualization module receives a designation from the application of a portion of the memory of the first virtual machine as purgeable memory, wherein the purgeable memory can be reconstructed by the application when the purgeable memory is unavailable. Changes are tracked to a processor state and to a remaining portion that is not purgeable memory and the changes are periodically forwarded at checkpoints to a secondary host machine. In response to an occurrence of a failure condition on the first virtual machine, the secondary host machine is signaled to continue execution of the application by using the forwarded changes to the remaining portion of the memory and by reconstructing the purgeable memory.

PRIORITY CLAIM

The present application is a continuation of U.S. patent application Ser. No. 13/253,519, titled “Resource Recovery for Checkpoint-Based High-Availability in a Virtualized Environment,” filed on Oct. 5, 2011, the contents of which is incorporated herein by reference in its entirety.

BACKGROUND

1. Technical Field

The present disclosure relates to a virtualized environment for a data processing system, and more particularly to resource recovery on a secondary host machine in the event of a failure condition of a virtual machine operating on a primary host machine.

2. Description of the Related Art

Checkpoint-based high-availability is a technique whereby a virtual machine running on a host machine (the “Primary host”) regularly (e.g., every 25 ms) mirrors its Central Processing Unit (CPU) and memory state onto another host machine (the “Secondary Host”). This mirroring process involves: 1. tracking changes to the memory and processor state of the virtual machine; 2. periodically stopping the virtual machine; 3. sending these changes over a network to the secondary host; 4. waiting for the secondary host to acknowledge receipt of the memory and CPU state update; and 5. resuming the virtual machine.

The mirroring process ensures that the secondary host is able to resume the workload with no loss of service should the primary host suffer a sudden hardware failure. If the secondary host either notices that the primary host is not responding, or receives an explicit notification from the primary host, the secondary host starts the mirrored version of the virtual machine. The effect to the outside world is that the virtual machine has seamlessly continued to execute across the failure of the primary host.

One of the key performance bottlenecks in this process is the rate at which pages of modified memory must be transferred from the primary host to the secondary host during execution. In all implementations of this technology today, modifications to memory can only be detected with page-level granularity, which is at least 4 Kbits. The hypervisor achieves this by marking all memory used by the virtual server as read-only following every checkpoint, and detecting the faults that occur when the virtual server attempts to write to a page of memory. The hypervisor can then record that the page has been modified and must therefore be transferred at the next checkpoint. Then the hypervisor can remove the write-protection so that future writes to that page do not cause a fault. At the next checkpoint, the memory is re-protected and the list of modified pages cleared.

The cost of this approach is therefore at least twofold. First, the first write to any page in a given checkpoint interval (the space between two checkpoints) causes a fault that must be handled before the workload can resume. Second, the page must be transferred to the secondary host, which consumes network bandwidth.

BRIEF SUMMARY

Disclosed are a method, a computer program product, and a data processing system that supports high availability of an application executed in a host virtual machine by forwarding at checkpoints changes in the host virtual machine to a secondary virtual machine. An amount of data required to be forwarded for failing over to the secondary virtual machine is reduced by not forwarding portions of memory of the first virtual machine that are deemed to be “not essential” (or purgeable) by the application. In particular, the application deems the portions of memory which the application can reconstruct as not essential and/or purgeable.

In one aspect, the present disclosure further provides a computer-implemented method for resource recovery. A processor executes an application on a primary host machine comprising a first virtual machine with the processor and a memory. A designation is received from the application of a portion of the memory of the first virtual machine as purgeable memory, wherein the purgeable memory can be reconstructed by the application when the purgeable memory is unavailable. Changes are tracked to a processor state and to a remaining portion of the memory assigned to the application that is not designated by the application as purgeable memory. The first virtual machine is periodically stopped. In response to stopping the first virtual machine, the changes to the remaining (essential) portion of the memory are forwarded to a secondary host machine comprising a second virtual machine. In response to completing the forwarding of the changes to the essential memory portions, execution of the first virtual machine is resumed. In response to an occurrence of a failure condition on the first virtual machine, the secondary host machine is signaled to continue execution of the application by using the forwarded changes to the remaining portion of the memory and by the application reconstructing the purgeable memory.

The above as well as additional objectives, features, and advantages of the present innovation will become apparent in the following detailed written description.

BRIEF DESCRIPTION OF THE DRAWINGS

The description of the illustrative embodiments is to be read in conjunction with the accompanying drawings, wherein:

FIG. 1 provides a block diagram representing a mirrored virtualized data processing system (DPS) environment that supports resource recovery, according to one or more embodiments;

FIG. 2 provides a flow diagram illustrating a methodology for resource recovery within a mirrored virtualized DPS environment of FIG. 1, according to one more embodiments;

FIG. 3 provides a flow diagram illustrating a methodology for designating memory as purgeable or not purgeable within a mirrored virtualized data processing system (DPS), according to one or more embodiments;

FIG. 4 provides a flow diagram illustrating a methodology for accessing memory designated as purgeable within a mirrored virtualized data processing system (DPS), according to one or more embodiments; and

FIGS. 5A-5D provide a sequence of diagrams illustrating operation of an Application Programming Interface (API) for protecting, accessing, and recovering from failures associated with purgeable memory of an application or program executed on a virtual machine, according to one or more embodiments.

DETAILED DESCRIPTION

The embodiments presented herein provide a method, a computer program product, and a data processing system that supports high availability of an application executed in a host virtual machine by forwarding at checkpoints changes to processor states and memory in the host virtual machine to a secondary virtual machine. The application deems the portions of memory which the application can reconstruct as not essential and/or purgeable. The amount of memory data required to be forwarded during the checkpoints to enable failing over to the secondary virtual machine is reduced by forwarding only portions of memory that are deemed essential by the application. The portions of memory of the first virtual machine that are deemed to be “not essential” (or purgeable) by the application are not forwarded and are reconstructed at the secondary virtual machine following a failover.

In the following detailed description of exemplary embodiments of the innovation, specific exemplary embodiments in which the innovation may be practiced are described in sufficient detail to enable those skilled in the art to practice the innovation, and it is to be understood that other embodiments may be utilized and that logical, architectural, programmatic, mechanical, electrical and other changes may be made without departing from the spirit or scope of the present innovation. The following detailed description is, therefore, not to be taken in a limiting sense, and the scope of the present innovation is defined by the appended claims and equivalents thereof.

Within the descriptions of the figures, similar elements are provided similar names and reference numerals as those of the previous figure(s). Where a later figure utilizes the element in a different context or with different functionality, the element is provided a different leading numeral representative of the figure number. The specific numerals assigned to the elements are provided solely to aid in the description and not meant to imply any limitations (structural or functional or otherwise) on the described embodiment.

It is understood that the use of specific component, device and/or parameter names (such as those of the executing utility/logic described herein) are for example only and not meant to imply any limitations on the described embodiments. The presented embodiments may thus be implemented with different nomenclature/terminology utilized to describe the components/devices/parameters herein, without limitation. Each term utilized herein is to be given its broadest interpretation given the context in which that terms is utilized.

As further described below, implementation of the functional features of the innovation is provided within processing devices/structures and involves use of a combination of hardware, firmware, as well as several software-level constructs (e.g., program code). The presented figures illustrate both hardware components and software components within example data processing.

Aspects of the illustrative embodiments provide a computer-implemented method, data processing system and computer program product by which a programmable application program interface (API) provides protection for portions of memory for an application executing on a primary virtual machine and implements a recovery mechanism when the memory is inaccessible following a failover to a secondary virtual machine. Further, the present innovation extends to both the operating system (OS) and the application a mechanism for enabling notification of reconstructable memory in a checkpoint based high availability system.

According to one or more embodiments, an API enables an application to mark regions of the application's memory as purgeable. This mark is used to indicate regions of memory which, if unavailable, the application would be able to recover. Examples of such regions of memory include, but are not limited to caches and other disposable components that the application can reconstruct. The information/mark that identifies purgeable memory is propagated to an underlying operating system or hypervisor to prevent transmission to the secondary host machine during a checkpointing operation on the primary host machine, thereby reducing a transmission burden across/on the network.

If a virtual machine, such as a virtual server, fails over to the secondary host machine due to hardware failure on the primary host machine, this purgeable memory, such as purgeable pages, are not be mapped in by the hypervisor and thus accesses to the purgeable pages by the application on the secondary host machine will cause a fault. The application responds to this fault by re-mapping memory and recreating any of the required caches on the secondary host machine.

It is appreciated that the computing environment in which the described embodiments can be practice can be a cloud computing environment. Cloud computing refers to Internet-based computing where shared resources, software, and information are provided to users of computer systems and other electronic devices (e.g., mobile phones) on demand, similar to the electricity grid. Adoption of cloud computing has been aided by the widespread utilization of virtualization, which is the creation of a virtual (rather than actual) version of something, e.g., an operating system, a server, a storage device, network resources, etc. A virtual machine (VM) is a software implementation of a physical machine (e.g., a computer system) that executes instructions like a physical machine. VMs are usually categorized as system VMs or process VMs. A system VM provides a complete system platform that supports the execution of a complete operating system (OS). In contrast, a process VM is usually designed to run a single program and support a single process. A VM characteristic is that application software running on the VM is limited to the resources and abstractions provided by the VM. System VMs (also referred to as hardware VMs) allow the sharing of the underlying physical machine resources between different VMs, each of which executes its own OS. The software that provides the virtualization and controls the VMs is typically referred to as a VM monitor (VMM) or hypervisor. A hypervisor may run on bare hardware (Type 1 or native VMM) or on top of an operating system (Type 2 or hosted VMM).

Cloud computing provides a consumption and delivery model for information technology (IT) services based on the Internet and involves over-the-Internet provisioning of dynamically scalable and usually virtualized resources. Cloud computing is facilitated by ease-of-access to remote computing websites (e.g., via the Internet or a private corporate network) and frequently takes the form of web-based tools or applications that a cloud consumer can access and use through a web browser, as if the tools or applications were a local program installed on a computer system of the cloud consumer. Commercial cloud implementations are generally expected to meet quality of service (QoS) requirements of consumers and typically include service level agreements (SLAs). Cloud consumers avoid capital expenditures by renting usage from a cloud vendor (i.e., a third-party provider). In a typical cloud implementation, cloud consumers consume resources as a service and pay only for resources used.

With reference now to the figures, and beginning with FIG. 1, there is depicted a block diagram representation of an example mirrored virtualized data processing system (DPS) environment/architecture, as utilized within one embodiment. The data processing system is described as having features common to a server computer. However, as used herein, the term “data processing system,” is intended to include any type of computing device or machine that is capable of receiving, storing and running a software product, including not only computer systems, but also devices such as communication devices (e.g., routers, switches, pagers, telephones, electronic books, electronic magazines and newspapers, etc.) and personal and home consumer devices (e.g., handheld computers, Web-enabled televisions, home automation systems, multimedia viewing systems, etc.).

FIG. 1 and the following discussion are intended to provide a brief, general description of an exemplary data processing system architecture adapted to implement the described embodiments. While embodiments will be described in the general context of instructions residing on hardware within a server computer, those skilled in the art will recognize that embodiments may be implemented in a combination of program modules running in an operating system. Generally, program modules include routines, programs, components, and data structures, which perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote memory storage devices.

A primary host machine such as DPS 100 can include hardware 102 including one or more processing units (“processor”) 104, a system memory 106, cache memory 108 for high speed storage of frequently used data, storage 110 such as hard drives, an input output adapter 112 and a network interface 114. Cache memory 108 can be connected to or communicatively coupled with processor 104 or can operatively be a part of system memory 106.

Primary host DPS 100 further includes computer readable storage media (“storage”) 110, such as one or more hard disk drives and one or more user interface devices. For instance, disk drives and user interface devices can be communicatively coupled to system interconnect fabric by an input-output (I/O) adapter 112. Disk drives provide nonvolatile storage for primary host DPS machine 100. User interface devices allow a user to provide input and receive output from primary host machine DPS 100. For example, user interface devices can include displays, keyboards and pointing devices such as a mouse. Although the description of computer readable storage media above refers to a hard disk, it should be appreciated by those skilled in the art that other types of media which are readable by a computer, such as removable magnetic disks, CD-ROM disks, magnetic cassettes, flash memory cards, digital video disks, Bernoulli cartridges, and other later-developed hardware, may also be used in the exemplary computer operating environment.

Primary host machine DPS 100 may operate in a networked environment 118 using logical connections to one or more remote computers or hosts, such as secondary host machine 120. Secondary host machine 120 may be a computer, a server, a router or a peer device and typically includes many or all of the elements described relative to primary host machine DPS 100. In a networked environment, program modules employed by primary host machine DPS 100, or portions thereof, may be stored in a remote memory storage device. The logical connections depicted in FIG. 1 include connections over a network 122. In an embodiment, network 122 may be a local area network (LAN). In alternative embodiments, network 122 may include a wide area network (WAN). Primary host machine DPS 100 is connected to network 122 through an input/output interface, such as the network interface 114. It will be appreciated that the network connections shown are exemplary and other means of establishing a communications link between the computers may be used.

Those of ordinary skill in the art will appreciate that the hardware components and basic configuration depicted in FIG. 1 may vary. The illustrative components within primary host machine DPS 100 are not intended to be exhaustive, but rather are representative to highlight essential components that are utilized to implement the present invention. For example, other devices/components may be used in addition to or in place of the hardware depicted. The depicted example is not meant to imply architectural or other limitations with respect to the presently described embodiments and/or the general invention. The data processing system depicted in FIG. 1 may be, for example, an IBM eServer pSeries system, a product of International Business Machines Corporation in Armonk, N.Y., running the Advanced Interactive Executive (AIX) operating system or LINUX operating system.

FIG. 1 illustrates multiple virtual machines operating on the primary and secondary host machine data processing systems 100 in a logically partitioned system according to an embodiment. Primary host machine DPS 100 includes several virtual machines (VM) or logical partitions (LPAR) such as VM 1 (LPAR1) 124 and VM 2 (LPAR2) 126. While two virtual machines or logical partitions are illustrated, many additional virtual machines or logical partitions can be used in DPS 100. Each of logical partitions VM 1 (LPAR1) 124 and VM 2 (LPAR2) 126 is a division of resources of primary host machine DPS 100.

Each LPAR such as VM 1 (LPAR1) 124 comprises a virtual processor (“Central Processing Unit (CPU)”) 128, virtual memory 130, virtual firmware 132 and virtual storage 134. VM 1 (LPAR1) 124 further includes functional modules or software modules such as virtual operating system (OS) 136 and application software (“application”) 138. Application 138 is executed at least within logical partition VM 1 (LPAR1) 124. VM 1 (LPAR1) 124 and VM 2 (LPAR2) 126 operate under the control of a hypervisor 140. Each VM (LPAR) 124, 126 can communicate with each other and with hypervisor 140. Hypervisor 140 manages interaction between and allocates resources between VM 1 (LPAR1) 124 and VM 2 (LPAR2) 126 and virtual processors such as virtual processor 128. Hypervisor 140 controls the operation of VMs (LPARs) 124, 126 and allows multiple operating systems 136 to run, unmodified, at the same time on primary host machine DPS 100 and provides a measure of robustness and stability to the system. Each operating system 136, and thus application 138, within the hypervisor 140 operates independently of the others, such that if one VM 1 (LPAR1) 124 experiences a failure, the other VMs, such as VM 2 (LPAR2) 126, can continue working without interruption.

Similarly, secondary host machine DPS 120 can include several virtual machines or logical partitions (LPAR). For clarity, only one VM 1′ (LPAR1′) 124′ is depicted. For clarity, secondary host machine DPS 120 can include identical or similar hardware 102 as primary host machine DPS 100. While one logical partition is illustrated, many additional logical partitions can be used in secondary host machine DPS 120. Each logical partition VM 1′ (LPAR1′) 124′ is a division of resources of secondary host machine DPS 120. The prime annotation is to denote that the logical partition VM 1′ (LPAR1′) 124′ and its constituent components are intended to be essentially identical to the VM 1 (LPAR1) 124 on the primary host machine DPS 100, at least as of a previous checkpoint 142 in time with a notable exception disclosed herein. In particular, the hypervisor 140 of the primary host machine DPS 100 comprises a virtualization module 144 for resource recovery that periodically forwards information regarding the VM 1 (LPAR1) 124 sufficient for the corresponding virtualization module 144 of hypervisor 140 the secondary host machine DPS 120 to create and maintain a version of the VM 1 (LPAR1) 124. Network 122 allows the hypervisors 140 to communicate with each other and to transfer data and operating parameters. For instances changes 146 for VM 1 (LPAR1) 124 at each checkpoint 142 can be transmitted. Thus, the version of the application 138′ executing on VM 1′ (LPAR1′) 124′ can seamlessly take over execution of the application 138 for a client system 149 should the secondary host machine DPS 120 detect or receive failover signaling 148 from the primary host machine DPS 100 when a VM fault 150 is detected.

In an exemplary aspect, the virtualization module 144 of the primary host machine DPS 100 performs functions to provide checkpoint-based high availability while mitigating an amount of data that is sent over the network 122. The virtualization module 144 receives a designation from the application 138 of a portion of the memory 130 of the first virtual machine (VM 1 (LPAR1) 124) as purgeable memory 152, wherein the purgeable memory 152 can be reconstructed by the application 138 when the purgeable memory 152 is unavailable. The virtualization module 144 tracks changes to a processor state and to a remaining portion (essential memory 154) of the memory 130 assigned to the application 138 that is not designated by the application 138 as purgeable memory 152. The virtualization module 144 periodically stops the first virtual machine (VM 1 (LPAR1) 124). In response to stopping the first virtual machine (VM 1 (LPAR1) 124), the virtualization module 144 forwards the changes to the remaining portion (essential memory 154) of the memory 130 to a secondary host machine 120 comprising a second virtual machine (VM 1′ (LPAR1′) 124′). In response to completing the forwarding of the changes, the virtualization module 144 resumes execution of the first virtual machine (VM 1 (LPAR1) 124). In response to an occurrence of a failure condition on the first virtual machine (VM 1 (LPAR1) 124), the virtualization module 144 signals the secondary host machine 120 to continue execution of the application 138 by using the forwarded changes to the remaining portion (essential memory 154) of the memory 130 and by reconstructing the purgeable memory 152.

In an exemplary aspect, an Application Program Interface (API) 156 resident in memory 130 provides memory protection to the essential memory 154 and the purgeable memory 152 by setting access codes 158, 160 respectively for essential memory 154 and purgeable memory 152 for conveying what is to be forwarded to the hypervisor 140.

Referring now to FIG. 2, the present disclosure provides a computer-implemented method 200 for resource recovery for a high availability of an application executed in a virtualized environment while reducing data traffic on a network. A processor executes an application on a primary host machine comprising a first virtual machine with the processor and a memory (block 202). A designation is received from the application of a portion of the memory of the first virtual machine as purgeable memory (block 204). The purgeable memory can be reconstructed by the application when the purgeable memory is unavailable and thus is not essential. Changes to a processor state and to a remaining portion of the memory assigned to the application that is not designated by the application as purgeable memory are tracked (block 206). The first virtual machine is periodically stopped (block 208). In response to stopping the first virtual machine, the changes to the remaining portion of the memory are forwarded to a secondary host machine comprising a second virtual machine. In response to completing the forwarding of the changes, execution of the first virtual machine is resumed (block 210). In response to an occurrence of a failure condition on the first virtual machine, the secondary host machine is signaled to continue execution of the application by using the forwarded changes to the remaining portion of the memory and by reconstructing the purgeable memory (block 212).

In FIG. 3, the present disclosure provides an exemplary method 300 for designating memory as either purgeable or not purgeable. An application program interface (API) provides memory protection, preventing direct access to the memory (block 302). Access to the memory is protected by setting an access code to purgeable or non-purgeable respectively at a plurality of first storage locations that marks respectively a corresponding second storage location of the memory (block 304). The API receives the designation of the purgeable memory as an API call from the application (block 306). In response to receiving the designation, the API marks the purgeable memory by setting the access code to purgeable for a selected first storage location that corresponds to a selected second storage location comprising the purgeable memory (block 308). In response to the access code being set as purgeable, the API prevents the purgeable memory from being forwarded to the secondary host machine when the changes to the remaining portion of the memory are forwarded to the second virtual machine (block 310). In an exemplary embodiment, the API communicates the access codes for the memory to an underlying layer (e.g., Operating System (OS) or hypervisor) that performs the forwarding of changes to the second host machine to prevent forwarding purgeable memory.

In FIG. 4, in a further exemplary aspect, a method 400 enables access by the application to the purgeable memory. The API receives an access call from the application to access the purgeable memory (block 402). In response to receiving the access call, the API registers a recovery function for the purgeable memory to provide a graceful recovery should a fault occur when attempting to access (block 404). The API changes the access code to locked for memory protection for the purgeable memory from an auto-removal policy (block 406). The API executes a user function (e.g., read, write) to access the purgeable memory specified by the access call (block 408). A determination is made as to whether a fault has occurred (block 410). For example, a fault can be determined that the purgeable memory is not present caused by attempting to access the purgeable memory. In response to a fault occuring, the API invokes the recovery function to instruct the secondary host machine to reconstruct the purgeable memory by the application on the secondary host machine (block 412). If however no fault occurred, then the API restores the access code to purgeable for memory protection (block 414).

It should be appreciated with the benefit of the present disclosure that there are several ways that an application could mark memory as purgeable. In one aspect, an API for purgeable memory can use a reference-counting approach, whereby the operating system is allowed to discard memory that is not in use by anyone, rather than paging the memory to disk. Memory that is marked as purgeable is not paged to disk when it is reclaimed by the virtual memory system because paging is a time-consuming process. Instead, the data is discarded, and if needed later, it will have to be recomputed.

In an illustrative aspect in FIGS. 5A-5D, a series of diagrams at subsequent points in time are depicted for using an API that marks purgeable memory and handles instances of the purgeable memory becoming inaccessible. With initial reference to FIG. 5A, a mark_purgeable( ) function marks regions of memory that will not be tracked as depicted 501. At a first time t₁, in block 502 a program (application) maps its memory 504 and sets up data structures 506, 508. At a second time t₂, in block 510 an API call by the program marks data structure 508 in memory 504 as purgeable, thus creating purgeable memory 508. Thus, the purgeable memory 508 is inaccessible to normal program code. The API call further propagates data to the operating system and an underlying hypervisor to prevent forwarding purgeable memory for resource recovery.

With reference to FIG. 5B, during program execution as depicted at 511, access to the purgeable data is wrapped with a call to access_purgeable( ) which changes the memory protection, registers a recovery function, and executes a specified user function. An initial state at a third time t₃, the purgeable memory is inaccessible to normal program code (block 512). Then at a fourth time t₄, the program makes an API call to access the purgeable memory. In block 513 in response the API unprotects the purgeable memory 508 and calls a specified user function 514. At a fifth time t₅, in block 516 upon return, the purgeable memory 508 is rendered inaccessible again to normal program code.

With reference to FIG. 5C, in the event of failover and subsequent restart on secondary host, purgeable memory has been unmapped and the recovery function is invoked as depicted at 517. An initial state at a sixth time t₆, the purgeable memory is inaccessible to normal program code (block 518). Then at a seventh time t₇, the program makes an API call to access the purgeable memory. In block 519, in response to the request to access the purgeable memory, the API unprotects the purgeable memory 508 and calls a specified user function 520. During access, a fault condition 522 occurs. At an eighth time t₈ in block 524, the recovery function is invoked so that the program can reconstruct the purgeable memory 508.

With reference to FIG. 5D, if the failover occurs outside of an access_purgeable( ) call, the purgeable memory can be unmapped without invoking the recovery function, which will instead be invoked as soon as the next access_purgeable call is performed as depicted at 525. At a ninth time t₉, in block 526, the purgeable memory 508 is rendered inaccessible to normal program code. Then a failover to a secondary host occurs as depicted at 528 at a tenth time t₁₀. As execution of the program on the secondary host machine commences in block 530, the memory 504′ of the secondary host is initially in the state where the essential memory 506′ has been received from the primary host machine but the purgeable memory 508′ has not. At an eleventh time t₁₁ in block 532, the API receives an API call from the program to access the purgeable memory 508′, which immediately results in detecting that the purgeable memory 508′ is no longer available/present. At a twelfth time t₁₂ in block 532, the API immediately invokes the recovery function so that the program reconstructs the purgeable memory 508′.

It should be appreciated with the benefit of the present disclosure that a failure can cause immediate failover. For example, a failure condition can include sudden power loss on the primary host machine that provides no opportunity, or requirement, for any processing to occur between the failure case and resumption on the secondary host machine after failover. The secondary host machine resumes from the most recent checkpoint on the secondary system. When the program attempts to access purgeable memory, a fault occurs due to the memory not being mapped, and the recovery function is invoked. Thus, the fault in this instance occurs on the secondary host machine. In other words, in this example, the program is inside the block that accesses purgeable memory when the program fails over to the secondary. Secondary host machine resumes execution from the last checkpoint, which in this example is still inside the block that accesses purgable memory. When an instruction attempts to read or write memory to the purgeable region, the read/write operation will cause a memory fault interrupt to be delivered. On receipt of this interrupt, the recovery function is invoked to restore the memory, and the operation on the purgeable memory is retried.

In another example, the program is not executing inside a block that accesses purgeable memory when failover occurs. Secondary host machine resumes execution from the last checkpoint, which in this example is still outside the block that accesses purgeable memory. When the program next enters a block that accesses purgeable memory, the recovery function will be immediately invoked as the purgeable memory is not available on the secondary system.

In an exemplary aspect, an underlying technology can be used such as Virtual Page Class Keys on POWER systems to make that memory inaccessible to normal program code, which might not tolerate the sudden disappearance of the memory in the event of failure. The Virtual Page Class Key Protection mechanism provides one methodology and/or means to assign virtual pages to one of thirty-two (32) classes, and to modify access permissions for each class quickly by modifying the Authority Mask Register (AMR). The access permissions associated with the Virtual Page Class Key Protection mechanism apply only to data accesses, and only when address translation is enabled. An illustrative Access Mask Register (AMR) is provided in TABLES A-B:

TABLE A Key0 Key1 Key2 . . . Key29 Key30 Key31 0 2 4 58 60 62

TABLE B Bits Name Description 0:1 Key0 Access mask for class number 0 2:3 Key1 Access mask for class number 0 . . . . . . . . . 2n:2n + 1 Keyn Access mask for class number n . . . . . . . . . 62:63 Key31 Access mask for class number 31

The access mask for each class defines the access permissions that apply to loads and stores for which the virtual address is translated using a Page Table Entry that contains a KEY field value equal to the class number. The access permissions associated with each class are defined as follows, where AMR_(2n) and AMR_(2n+1) refer to the first and second bits of the access mask corresponding to class number n:

-   -   (a) A store is permitted if AMR_(2n)=0b0; otherwise the store is         not permitted.     -   (b) A load is permitted if AMR_(2n+1)=0b0; otherwise the load is         not permitted.

The AMR can be accessed using either Special Purpose Register (SPR) 13 or SPR 29. Access to the AMR using SPR 29 is privileged.

Third, an API call allows a user-specified function to access purgeable code. To do this, the API call must also specify a function to invoke if the purgeable memory vanishes due to failover. The API would modify the AMR register on the way in to and out of the specified function, so that purgeable memory could be accessed as normal within the user-specified function. This API call would be used, for example, to wrap calls to “add an item to a cache” and “lookup an item in the cache”. Any removal of the purgeable memory while the operation was in progress would call a handler function that would reinitialize the required data structures and then re-perform the operation.

It should be appreciated with the benefit of the present disclosure that use of Class Keys for page protection is not necessary, but illustrates one way of guaranteeing that no access to the purgeable memory is possible outside of the specially guarded code. In one aspect, this feature can be used during application set up (or initialization) time and switched off in production code in order mitigate certain performance impacts.

Thus, returning to FIGS. 5A-5D, a mark_purgeable( ) function, depicted at 534, can be used to mark specific regions of memory that will not be tracked. During program execution, access to the purgeable data is wrapped with a call to access_purgeable( ) depicted at 536, which changes the memory protection, registers a recovery function and executes a specified user function.

In the event of a failover and subsequent restart on the secondary host, the purgeable memory has been unmapped and the recovery function is invoked. If the failover occurs outside an access_purgeable( ) call, the purgeable memory can be unmapped without invoking the recovery function, which will instead be invoked as soon as the next access_purgeable( ) call is performed.

The innovations herein describe the process as the process would apply to application-accessible memory. However, the same technique could be used within an operating system, to mark up regions of memory such as file caches and network buffers that could be flushed and refilled in the event of failover.

In each of the flow charts above, one or more of the methods may be embodied in a computer readable medium containing computer readable code such that a series of steps are performed when the computer readable code is executed on a computing device. In some implementations, certain steps of the methods are combined, performed simultaneously or in a different order, or perhaps omitted, without deviating from the spirit and scope of the innovation. Thus, while the method steps are described and illustrated in a particular sequence, use of a specific sequence of steps is not meant to imply any limitations on the innovation. Changes may be made with regards to the sequence of steps without departing from the spirit or scope of the present innovation. Use of a particular sequence is therefore, not to be taken in a limiting sense, and the scope of the present innovation is defined only by the appended claims.

As will be appreciated by one skilled in the art, aspects of the present innovation may be embodied as a system, method or computer program product. Accordingly, aspects of the present innovation may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module” or “system.” Furthermore, aspects of the present innovation may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied thereon.

Any combination of one or more computer readable medium(s) may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, R.F, etc., or any suitable combination of the foregoing. Computer program code for carrying out operations for aspects of the present innovation may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like and conventional procedural programming languages, such as the “C” programming language or assembly level programming or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).

Aspects of the present innovation are described below with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the innovation. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks. The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

As will be further appreciated, the processes in embodiments of the present innovation may be implemented using any combination of software, firmware or hardware. As a preparatory step to practicing the innovation in software, the programming code (whether software or firmware) will typically be stored in one or more machine readable storage mediums such as fixed (hard) drives, diskettes, optical disks, magnetic tape, semiconductor memories such as ROMs, PROMs, etc., thereby making an article of manufacture in accordance with the innovation. The article of manufacture containing the programming code is used by either executing the code directly from the storage device, by copying the code from the storage device into another storage device such as a hard disk, RAM, etc., or by transmitting the code for remote execution using transmission type media such as digital and analog communication links. The methods of the innovation may be practiced by combining one or more machine-readable storage devices containing the code according to the present innovation with appropriate processing hardware to execute the code contained therein. An apparatus for practicing the innovation could be one or more processing devices and storage systems containing or having network access to program(s) coded in accordance with the innovation.

Thus, it is important that while an illustrative embodiment of the present innovation is described in the context of a fully functional computer (server) system with installed (or executed) software, those skilled in the art will appreciate that the software aspects of an illustrative embodiment of the present innovation are capable of being distributed as a program product in a variety of forms, and that an illustrative embodiment of the present innovation applies equally regardless of the particular type of media used to actually carry out the distribution.

While the innovation has been described with reference to exemplary embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted for elements thereof without departing from the scope of the innovation. In addition, many modifications may be made to adapt a particular system, device or component thereof to the teachings of the innovation without departing from the essential scope thereof. Therefore, it is intended that the innovation not be limited to the particular embodiments disclosed for carrying out this innovation, but that the innovation will include all embodiments falling within the scope of the appended claims. Moreover, the use of the terms first, second, etc. do not denote any order or importance, but rather the terms first, second, etc. are used to distinguish one element from another.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the innovation. As used herein, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

The corresponding structures, materials, acts, and equivalents of all means or step plus function elements in the claims below are intended to include any structure, material, or act for performing the function in combination with other claimed elements as specifically claimed. The description of the present innovation has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the innovation in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the innovation. The embodiment was chosen and described in order to best explain the principles of the innovation and the practical application, and to enable others of ordinary skill in the art to understand the innovation for various embodiments with various modifications as are suited to the particular use contemplated. 

What is claimed is:
 1. A computer-implemented method for resource recovery, the method comprising: a processor executing an application on a primary host machine comprising a first virtual machine with the processor and a memory; receiving a designation from the application of a portion of the memory of the first virtual machine as purgeable memory, wherein the purgeable memory represents portions of memory that can be reconstructed by the application when the purgeable memory is unavailable; tracking changes to a processor state and to a remaining portion of the memory that is not designated by the application as purgeable memory; periodically stopping the first virtual machine; in response to stopping the first virtual machine, forwarding the changes to the remaining portion of the memory to a secondary host machine comprising a second virtual machine; in response to completing the forwarding of the changes, resuming execution of the first virtual machine; and in response to an occurrence of a failure condition on the first virtual machine, signaling the secondary host machine to continue execution of the application by using the forwarded changes to the remaining portion of the memory and by having the application reconstruct the purgeable memory at the second virtual machine.
 2. The method of claim 1, further comprising: protecting access to the memory by setting an access code to one of purgeable and non-purgeable, respectively, at a plurality of first storage locations that marks respectively a corresponding plurality of second storage locations of the memory; receiving the designation of the purgeable memory as an application program interface (API) call; in response to receiving the designation, marking the purgeable memory by setting the access code to purgeable for a selected first storage location that corresponds to a selected second storage location comprising the purgeable memory; and in response to the access code being set as purgeable, preventing the purgeable memory from being forwarded to the secondary host machine while the changes to the remaining portion of the memory are forwarded to the second virtual machine during a checkpoint at the first virtual machine.
 3. The method of claim 2, further comprising enabling access by the application to the purgeable memory by: receiving an access call from the application to access the purgeable memory; changing the access code to locked for memory protection for the purgeable memory to override an auto-removal policy; executing a user function to access the purgeable memory specified by the access call; and restoring the access code to purgeable for memory protection.
 4. The method of claim 3, further comprising providing an application programming interface for memory protection that enables access to memory and prevents direct access to the purgeable memory.
 5. The method of claim 3, further comprising: in response to receiving the access call, registering a recovery function for the purgeable memory; and in response to occurrence of a fault associated with the purgeable memory not being present wherein the fault is caused by the application attempting to access the purgeable memory, invoking the recovery function to instruct the secondary host machine to reconstruct the purgeable memory by the application on the secondary host machine.
 6. The method of claim 1, further comprising, in response to the occurrence of the failure condition, an operating system of the primary host machine signaling to the secondary host machine to continue execution of the application by using the forwarded changes to the remaining portion of the memory and by having the application reconstruct the purgeable memory at the secondary host machine.
 7. The method of claim 1, further comprising, in response to the occurrence of the failure condition, a hypervisor of the primary host machine signaling to the secondary host machine to continue execution of the application at the secondary host machine by using the forwarded changes to the remaining portion of the memory and by having the application reconstruct the purgeable memory. 